Introduction

Sir Arthur Ignatius Conan Doyle (22 May 1859 – 7 July 1930) was a British writer and physician. He created the character Sherlock Holmes in 1887 for A Study in Scarlet, the first of four novels and fifty-six short stories about Holmes and Dr. Watson. The Sherlock Holmes stories are milestones in the field of crime fiction.

Referring to himself as a “consulting detective” in the stories, Holmes is known for his proficiency with observation, deduction, forensic science, and logical reasoning that borders on the fantastic, which he employs when investigating cases for a wide variety of clients, including Scotland Yard.

The character and stories have had a profound and lasting effect on mystery writing and popular culture as a whole, with the original tales as well as thousands written by authors other than Conan Doyle being adapted into stage and radio plays, television, films, video games, and other media for over one hundred years.

In this presentation we will analyze the four novels with techniques of text mining to obtain information on the language used, how homogeneous it is, and how it is classified through the sentiment analysis.

Short summary of the novels:

A Study in Scarlet. The story marks the first appearance of Sherlock Holmes and Dr. Watson, who would become the most famous detective duo in literature. The book’s title derives from a speech given by Holmes, a consulting detective, to his friend and chronicler Watson on the nature of his work, in which he describes the story’s murder investigation as his “study in scarlet”: “There’s the scarlet thread of murder running through the colourless skein of life, and our duty is to unravel it, and isolate it, and expose every inch of it.”

The Sign of the Four. As a dense yellow fog swirls through the streets of London, a deep melancholy has descended on Sherlock Holmes, who sits in a cocaine-induced haze at 221B Baker Street. His mood is only lifted by a visit from a beautiful but distressed young woman - Mary Morstan, whose father vanished ten years before. Four years later she began to receive an exquisite gift every year: a large, lustrous pearl. Now she has had an intriguing invitation to meet her unknown benefactor and urges Holmes and Watson to accompany her. And in the ensuing investigation - which involves a wronged woman, a stolen hoard of Indian treasure, a wooden-legged ruffian, a helpful dog and a love affair - even the jaded Holmes is moved to exclaim, ‘Isn’t it gorgeous!’

The Hound of the Baskervilles. The Hound of the Baskervilles is the third of the four crime novels written by Sir Arthur Conan Doyle featuring the detective Sherlock Holmes. Originally serialised in The Strand Magazine from August 1901 to April 1902, it is set largely on Dartmoor in Devon in England’s West Country and tells the story of an attempted murder inspired by the legend of a fearsome, diabolical hound of supernatural origin. Sherlock Holmes and his companion Dr. Watson investigate the case. This was the first appearance of Holmes since his apparent death in “The Final Problem”, and the success of The Hound of the Baskervilles led to the character’s eventual revival.

The Valley of Fear. Doyle’s final novel featuring the beloved sleuth, Sherlock Holmes, brings the detective and his friend to a country manor where they are preceded by either a murder or a suicide. A card with the initials VV 341 has been left by the body, and discovering the facts of the case gets ever more difficult. The answers to this mystery lie far away from the scene of the crime and across the Atlantic, in a place known as ‘The Valley of Fear’. A secretive organization lies culprit and an infiltration of it is in order.

Text mining

Frequency analysis with wordcloud

This is a wordcloud, a collection, or cluster, of words depicted in different sizes. The biggest and boldest the word appears, the more often it’s mentioned within a given text. As you can see the biggest words are “Holmes”, the name of the main character, “sir” a prefix used in the Victorian era when someone refereed to another man. Finally “house”, “time”, “hand”, “eyes”, “night” that are important words related with murders, the house refers to the murdered one or the Holmes one, the time of when it happened, the hand is what holds murder weapon, the eyes of witness, and the night is the time of the day when most of the murders happens.

Bigram Frequency

In this case I consider bigrams. This is a plot which each point corresponds to a bigram (you can check which of them hovering with mouse). It is interesting that the bigram “Sherlock Holmes” occurrences decline over the novels, from 50 times in the first book to only 11 in the last one. What we can observe is that the bigrams with a significant number of occurrences are names such as “Sir Charles” or “Miss Mortsan”. The dashed line is the boundary between the top 10 bigrams and the others.

Bigrams as a network

We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few divided by book. We can arrange the words into a network, or “graph”.

  • from: the first word of the bigram
  • to: the second word of the bigram
  • color of the arrow: the number of times the bigram occurs (more times more opaque)

I exuded the brigrams that occurred less than 9 times to exmine only the important ones, for the most part like before we can see name of characters and places, occasionally we can see some objects such as the “wedding ring” in The Valley of Fear.

tf-idf

What are the highest tf-idf words in these four novels? The statistic tf-idf identifies words that are important to a document in a collection of documents; in this case, we’ll see which words are important in one of the novels compared to the others. What measuring tf-idf has done here is show us that Arthur Conan Doyle used similar language across hir four novels, and what distinguishes one novel from the rest within the collection of his works are the proper nouns, the names of people and places.

For the sake of completeness we apply the tf-idf technique also to the bigrams and we see that most of them contains the unigrams that we found previously.

Sentiment analysis

Sentiment anlysis on the four novels

Which are the main sentiments in the books? Of course these are detective novels so I expect that the negative sentiments are the most present, since they talk about murders and mysteries. I use the sum of the occurrences of the words for each sentiment in order to highlight that. I use colors to highlight negative and positive sentiments. As we expect all the novels have a lot of negative words, we could say that only 1/3 of the words are positive We will use the BING sentiment lexicon.

Negative vs positive

Then I plot the top 10 words that contribute the most to the general negative and positive sentiment. As we can see in the positive words only 2 exceed 50 unlike all the negative ones.

Top sentiment contribution

We want now explore the top sentiment words for each one of the novels and how much they are involved in defining the sentiment of the book, their contribution is calculated by the value of every word times the number of occurrences, divided by the number of words. For this reason we will use the AFINN sentiment lexicon, in this way each word is not only positive or negative (-1, +1) but it has also a weight attached (from -5 to +5).

We can see that most of the words are what we can expect within the detective novels genre except in the book The sign of the four where the top positive and negative words are coded in this way but the first is used to refer to the treasure behind the main mystery of the story and the second is used as a title for Miss Mary Morstan a very important character that will become the wife of Dr. Watson.

False friends

It is not strange that there is the possibility of false friends meaning a positive or negative word preceded by a negation, this would change the word to the opposite meaning. For that reason I decided to explore all the bigrams with not as the first word and then I calculated the contribution of each word by doing the product between his sentiment value and the number of occurrences. Lastly I plotted the top 20 word preceded by not based on their absolute contribution.

The bigrams “not help” and “not wish” were overwhelmingly the largest causes of misidentification, making the text seem much more positive than it is. But we can see phrases like “not fear” and “not ashamed” sometimes suggest text is more negative than it is.

“Not” isn’t the only term that provides some context for the following word. We could pick six common words (“not”, “without”, “don’t”, “never”, “won’t”, “no”) that negate the subsequent term, and use the same joining and counting approach to examine all of them at once. While “not help” is still one of the most common example, we can also see pairings such as “no doubt”, “no harm”, “no great” and “no good”. We can observe that there is a sightly abundance of negative false friends but considering that the novels have a decisively negative sentiment we are not too much concerned.

Correlation between the novels

What are the words that differs between novels? What are the ones that are used in every book?

Now, let’s calculate the frequency for each word across the entire Sherlock Holmes series versus within each book. This will allow us to compare strong deviations of word frequency within each book as compared to across the entire series.

Words that are close to the line in these plots have similar frequencies across all the novels. For example, words such as “holmes” and “house”, are fairly common and used with similar frequencies across most of the books. Words that are far from the line are words that are found more in one set of texts than another. Furthermore, words standing out above the line are common across the series but not within that book; whereas words below the line are common in that particular book but not across the series. For example, “watson” stands out above the line in the “A Study in Scarlet”. This means that “watson” is fairly common across the entire Sherlock Holmes series but is not used as much in “A Study in Scarlet”. In contrast, a word below the line such as “baskerville” in “The Hound of the Baskervilles” suggests this word is common in this novel but far less common across the series.

Examining Pairwise Correlation between sections

Tokenizing by n-gram, as we saw previously, is a useful way to explore pairs of adjacent words. However, we may also be interested in words that tend to co-occur within particular documents or particular chapters, even if they don’t occur next to each other.

So we divide the book by sections of 10 row each. Now we want to examine correlation among words, which indicates how often they appear together relative to how often they appear separately. In particular, we’ll focus on the Pearson correlation that it is equivalent to the phi coefficient, a common measure for binary correlation. The focus of the phi coefficient is how much more likely it is that either both word X and Y appear, or neither do, than that one appears without the other.

Let’s plot the top 4 word that correlate to holmes, house, sir, and time, that are the four most frequent words.

Just as we used a graph to visualize bigrams, we can use it to visualize the correlations and clusters of words that we found. Note that unlike the bigram analysis, the relationships here are symmetrical, rather than directional (there are no arrows). We can also see that while pairings of names and titles that dominated bigram pairings are common, such as “sherlock/holmes” or “sir/henry”, we can also see pairings of words that appear close to each other, such as “common” and “sense”, or “lantern” and “light”. The four most common words are highlighted in red, we notice the absence of the words “time” and “house” this means that although they are used very often, they are scattered in all the sections and not in specific ones.

Topic modeling

Comparison of the topics of the four novels, have they some topics in common?

We will use Latent Dirichlet allocation (LDA), a particularly popular method, for fitting a topic model. t treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.

We first use the model found by LDA for extracting the per-topic-per-word probabilities, called \(\beta\) (“beta”), from the model.

Now let’s visualize 10 words that are the most common in each topic that were extracted from the books. All the 2 topics seem to include general words, the first one seem more about the witness and place of the found bodies, meanwhile the second one seem about the murder and the time of it.

What topic define each book?

Besides estimating each topic as a mixture of words, stm assign the probability that each book is generated from each topic. We can examine the per-document-per-topic probabilities, called \(\gamma\) (“gamma”).

In this case each novel has 50% of each topic. This reinforce what we saw untill now, the novels use similar language and in a similar way.

Conclusions

This analysis proves that through data science we are able to extract all the information necessary to understand a text in a sufficiently precise way and find out its secrets. Starting from simple text mining techniques we analyzed the word frequency and the most related words in order to have a shallow idea of the topics of the book. We then explored the sentiment of each book and we understand that you have to look at group of words and the relations between them to truly extract the true information behind the characters. Finally, we studied more deeply the book with topic modeling in order to better understand the topic of the texts.